Improving the Morphological Analysis of Classical Sanskrit
نویسنده
چکیده
The paper describes a new tagset for the morphological disambiguation of Sanskrit, and compares the accuracy of two machine learning methods (CRF, deep recurrent neural networks) for this task, with a special focus on how to model the lexicographic information. It reports a significant improvement over previously published results. 1 Challenges of Sanskrit Linguistics and Related Research Classical Sanskrit is a strongly inflecting Old Indo-Aryan language that developed out of earlier Vedic dialects in the middle of the first millenium BCE. Ever since, Sanskrit has been the main medium for transmitting the large corpus of religious, philosophical, scientific, and literary texts that shaped the intellectual history of ancient India. Sanskrit poses considerable challenges for NLP at the levels of tokenization, lemmatization, and morphological analysis (Kulkarni and Shukla, 2009). These three steps are deeply intertwined in Sanskrit, because single word forms (padas) are merged by a set of phonetic rules called Sandhi “connection” into larger strings. In order to analyze a sentence at the morphological and lexical level, an NLP tool must be able to simultaneously resolve the Sandhis, and to detect the correct morphological and lexical path in the resulting lattice of word hypotheses. As a consequence, the tokenization of a sentence is guided by its lexical and morphological analyses. Due to these linguistic peculiarities, morphological ambiguity is introduced on three levels: Inherent : Isolated Sanskrit forms are frequently ambiguous. The verbal form gacchati, for example, has three readings as 3rdSG.PR of the verb gam ‘to go’ (“(s)he / it goes”), L.SG.M. of the present participle of this verb (“in the going [some referent]”), and L.SG.N. of the same participle. Sandhi : When morphologically unambiguous forms such as draupadı̄ (N.SG.F. of draupadı̄ ‘name of a woman’) are processed with Sandhi rules, they can become ambiguous. While the sentence draupadı̄ gacchati ‘Draupadı̄ goes’ allows only one reading of draupadı̄, the sentence draupadı̄ āgacchati ‘Draupadı̄ arrives’ is further processed by the Sandhi rule ı̄ + ā = yā, resulting in draupadyāgacchati. When this string is analyzed with an NLP tool, the sequence -yācan be resolved into (1) the “correct” source phonemes ı̄ + ā, but also into (2) i + ā, (3) ya + a, (4) ya + ā, (5) yā + a, or (6) yā + ā, where solutions (1), (2), (5), and (6) represent lexico-morphologically, but not necessarily semantically valid readings.1 The morphological analyzer (MA) has to decide between three readings draupadı̄ (N.SG.), draupadi (V.SG.), and draupadyā (I.SG.), which are distinct in their un-Sandhied, phonetically disambiguated forms. bahuvrı̄hi compounds : Sanskrit has a highly productive class of compounds called bahuvrı̄his (“much rice”), which form possessive expressions. Compounds of this class behave like adjectives, because they inherit the inflectional information from their governing possessors. While the non-possessive This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http: //creativecommons.org/licenses/by/4.0/. (2) “O Draupadı̄, he/she/it comes”; (5*) “With Draupadı̄ ... in the not-going”; (6) “He/she/it arrives together with Draupadı̄”
منابع مشابه
Design of a lean interface for Sanskrit corpus annotation
We describe an innovative computer interface designed for assisting annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpus. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting on the sandhi rules used, and aligning on the input sente...
متن کاملA Collaborative Platform for Sanskrit Processing
Sanskrit, the classical language of India, presents specific challenges for computational linguistics: exact phonetic transcription in writing that obscures word boundaries, rich morphology and an enormous corpus, among others. Recent international cooperation has developed innovative solutions to these problems and significant resources for linguistic research. Solutions include efficient segm...
متن کاملA Distributed Platform for Sanskrit Processing
Sanskrit, the classical language of India, presents specific challenges for computational linguistics: exact phonetic transcription in writing that obscures word boundaries, rich morphology and an enormous corpus, among others. Recent international cooperation has developed innovative solutions to these problems and significant resources for linguistic research. Solutions include efficient segm...
متن کاملSanskrit as a Programming Language and Natural Language Processing
In this paper represents the work toward developing a dependency parser for Sanskrit language and also represents the efforts in developing a NLU(Natural Language Understanding) and NLP(Natural Language Processing) systems. Here, we use ashtadhayayi (a book of Sanskrit grammar) to implement this idea. We use this concept because the Sanskrit is an unambiguous language. In this paper, we are pre...
متن کاملEfficient Recognition of Telugu Characters Based on Critical Points Generated Using Morphological Methods
A novel method for recognition of telugu character is proposed in this paper. The proposed method uses extraction of critical points of the characters based on grid and radial intersections analysis. The extracted critical points are classified based on the grid and radial lines, which helps in improving accuracy in recognition of characters. The algorithm is tested on various data sets and the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016